Goto

Collaborating Authors

 compute cluster


Bayesian Distributed Stochastic Gradient Descent

Michael Teng, Frank Wood

Neural Information Processing Systems

We introduce Bayesian distributed stochastic gradient descent (BDSGD), a high-throughput algorithm for training deep neural networks on parallel computing clusters. This algorithm uses amortized inference in a deep generative model to perform joint posterior predictive inference of mini-batch gradient computation times in a compute cluster specific manner. Specifically, our algorithm mitigates the straggler effect in synchronous, gradient-based optimization by choosing an optimal cutoff beyond which mini-batch gradient messages from slow workers are ignored. The principle novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance improves over the static-cutoff prior art, leading to higher gradient computation throughput on large compute clusters. In our experiments we show that eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but sometimes also increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness.




Meta builds world's largest AI superclusters for the future

FOX News

The CyberGuy Kurt Knutsson joins'Fox & Friends' to discuss the U.S.-Saudi investment summit and the debate over regulation as artificial intelligence continues to advance. What happens when one of the world's richest companies decides to go all-in on artificial intelligence? If you're Meta Platforms CEO Mark Zuckerberg, it means launching superclusters so large they could rival the footprint of Manhattan. Recently, Zuckerberg unveiled plans to invest "hundreds of billions of dollars" into next-generation AI infrastructure, including some of the largest compute clusters the world has ever seen. Meta's first supercluster, called Prometheus, is slated to go live in 2026.


VEXP: A Low-Cost RISC-V ISA Extension for Accelerated Softmax Computation in Transformers

Wang, Run, Islamoglu, Gamze, Belano, Andrea, Potocnik, Viviane, Conti, Francesco, Garofalo, Angelo, Benini, Luca

arXiv.org Artificial Intelligence

--While Transformers are dominated by Floating-Point (FP) Matrix-Multiplications, their aggressive acceleration through dedicated hardware or many-core programmable systems has shifted the performance bottleneck to non-linear functions like Softmax. Accelerating Softmax is challenging due to its non-pointwise, non-linear nature, with exponentiation as the most demanding step. T o address this, we design a custom arithmetic block for Bfloat16 exponentiation leveraging a novel approximation algorithm based on Schraudolph's method, and we integrate it into the Floating-Point Unit (FPU) of the RISC-V cores [1] of a compute cluster, through custom Instruction Set Architecture (ISA) extensions, with a negligible area overhead of 1 %. By optimizing the software kernels to leverage the extension, we execute Softmax with 162.7 less latency and 74.3 less energy compared to the baseline cluster, achieving an 8.2 performance improvement and 4.1 higher energy efficiency for the FlashAttention-2 kernel in GPT -2 configuration. Moreover, the proposed approach enables a multi-cluster system to efficiently execute end-to-end inference of pre-trained Transformer models, such as GPT -2, GPT -3 and ViT, achieving up to 5.8 and 3.6 reduction in latency and energy consumption, respectively, without requiring re-training and with negligible accuracy loss. Transformer-based models such as the GPT family [2] and the LLaMa family [3], have emerged as a cornerstone of machine learning, demonstrating state-of-the-art performance in diverse domains, including natural language processing (NLP), computer vision, and audio processing. At the core of their success is the Transformer architecture [4], which utilizes the self-attention mechanism to model complex relationships within input sequences. In encoders and the prefill stage of decoders, the computational complexity of attention layers scales quadratically with the input sequence length, leading to memory and computational overheads that necessitate mitigation by means of dedicated acceleration. This work was supported by the NeuroSoC project, funded under the European Union's Horizon Europe research and innovation programme (Grant Agreement No. 101070634). For each sequence length, the left bar shows unoptimized GEMM results, while the right bar reflects optimized GEMM results.


Why Europe's Efforts to Gain AI Autonomy Might Be Too Little Too Late

TIME - Tech

This week Microsoft announced that it would invest 3.2 billion ( 3.5 billion) in Germany over the next two years. The U.S. tech giant will use the money to double the capacity of its artificial intelligence and data center infrastructure in Germany and expand its training programmes, according to Microsoft vice chair and president Brad Smith. The move follows a similar announcement from November 2023, when Microsoft said it would invest 2.5 billion ( 3.2 billion) in infrastructure in the U.K. over the next three years. Both countries hailed the investments as significant steps that would permit them to compete on the world stage when it comes to AI. However, the investments are dwarfed by investments made by U.S.-based cloud service providers elsewhere, particularly in the U.S. As AI becomes increasingly economically and militarily important, governments are taking steps to ensure they have control over the technology that they depend on.


Compute at Scale: A Broad Investigation into the Data Center Industry

Pilz, Konstantin, Heim, Lennart

arXiv.org Artificial Intelligence

This report characterizes the data center industry and its importance for AI development. Data centers are industrial facilities that efficiently provide compute at scale and thus constitute the engine rooms of today's digital economy. As large-scale AI training and inference become increasingly computationally expensive, they are dominantly executed from this designated infrastructure. Key features of data centers include large-scale compute clusters that require extensive cooling and consume large amounts of power, the need for fast connectivity both within the data center and to the internet, and an emphasis on security and reliability. The global industry is valued at approximately $250B and is expected to double over the next seven years. There are likely about 500 large (above 10 MW) data centers globally, with the US, Europe, and China constituting the most important markets. The report further covers important actors, business models, main inputs, and typical locations of data centers.


Q-EEGNet: an Energy-Efficient 8-bit Quantized Parallel EEGNet Implementation for Edge Motor-Imagery Brain--Machine Interfaces

Schneider, Tibor, Wang, Xiaying, Hersche, Michael, Cavigelli, Lukas, Benini, Luca

arXiv.org Artificial Intelligence

Motor-Imagery Brain--Machine Interfaces (MI-BMIs)promise direct and accessible communication between human brains and machines by analyzing brain activities recorded with Electroencephalography (EEG). Latency, reliability, and privacy constraints make it unsuitable to offload the computation to the cloud. Practical use cases demand a wearable, battery-operated device with low average power consumption for long-term use. Recently, sophisticated algorithms, in particular deep learning models, have emerged for classifying EEG signals. While reaching outstanding accuracy, these models often exceed the limitations of edge devices due to their memory and computational requirements. In this paper, we demonstrate algorithmic and implementation optimizations for EEGNET, a compact Convolutional Neural Network (CNN) suitable for many BMI paradigms. We quantize weights and activations to 8-bit fixed-point with a negligible accuracy loss of 0.4% on 4-class MI, and present an energy-efficient hardware-aware implementation on the Mr.Wolf parallel ultra-low power (PULP) System-on-Chip (SoC) by utilizing its custom RISC-V ISA extensions and 8-core compute cluster. With our proposed optimization steps, we can obtain an overall speedup of 64x and a reduction of up to 85% in memory footprint with respect to a single-core layer-wise baseline implementation. Our implementation takes only 5.82 ms and consumes 0.627 mJ per inference. With 21.0GMAC/s/W, it is 256x more energy-efficient than an EEGNET implementation on an ARM Cortex-M7 (0.082GMAC/s/W).


Stability AI builds foundation models on Amazon SageMaker

#artificialintelligence

We're thrilled to announce that Stability AI has selected AWS as its preferred cloud provider to power its state-of-the-art AI models for image, language, audio, video, and 3D content generation. Stability AI is a community-driven, open-source artificial intelligence (AI) company developing breakthrough technologies. With Amazon SageMaker, Stability AI will build AI models on compute clusters with thousands of GPU or AWS Trainium chips, reducing training time and cost by 58%. Stability AI will also collaborate with AWS to enable students, researchers, startups, and enterprises around the world to use its open-source tools and models. "Our mission at Stability AI is to build the foundation to activate humanity's potential through AI. AWS has been an integral partner in scaling our open-source foundation models across modalities, and we are delighted to bring these to SageMaker to enable tens of thousands of developers and millions of users to take advantage of them. We look forward to seeing the amazing things built on these models and helping our customers customize and scale their models and solutions."


Train ML models - Azure Machine Learning

#artificialintelligence

Azure Machine Learning provides multiple ways to submit ML training jobs. In this article, you'll learn how to submit jobs using the following methods: SDK v2 is currently in public preview. The preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.